Zurich and Vaud by the Numbers - Predictive Insights into Tourism Dynamic

Authors
Affiliation

Valeriia Bilousko, Urs Hurni, Jayesh Smith

University of Lausanne

Anastasia Pushkarev, Emeline Raimon

Published

May 20, 2024

Abstract

The following Forecasting project focuses on …

1 Exploration & Visualization

1.1 Objectives

The main objectives of this project is to predict :

  • The overnight stays of the visitors in Vaud, from October 2023 until December 2024.

  • The overnight stays of visitors from Philippines to Zürich, from October 2023 until December 2024.

2 Data

2.1 Cleaning & Wrangling

Here we load the data and clean it. We will focus on the data for Vaud and Zurich, as well as the data for the Philippines.

Click to show code
# Load the data in folder data named Dataset_tourism.xlsx)
tourism_data <- readxl::read_xlsx(here("data/Dataset_tourism.xlsx"))

#removing value 'Herkunftsland - Total' in column 'Herkunftsland' as it is just the total
#tourism_data <- tourism_data %>% filter(Herkunftsland != "Herkunftsland - Total")
#print unique values in month column
unique(tourism_data$Monat)
#>  [1] "Januar"    "Februar"   "März"      "April"     "Mai"      
#>  [6] "Juni"      "Juli"      "August"    "September" "Oktober"  
#> [11] "November"  "Dezember"
# change ' [1] "Januar"    "Februar"   "März"      "April"     "Mai"       "Juni"      "Juli"      "August" "September" "Oktober"   "November"  "Dezember" into english month'
tourism_data$Monat <- tourism_data$Monat %>% recode_factor(
  "Januar" = "January",
  "Februar" = "February",
  "März" = "March",
  "April" = "April",
  "Mai" = "May",
  "Juni" = "June",
  "Juli" = "July",
  "August" = "August",
  "September" = "September",
  "Oktober" = "October",
  "November" = "November",
  "Dezember" = "December"
)
#add date type column for plotting purposes
tourism_data <- tourism_data %>% mutate(Date = dmy(paste("01", Monat, Jahr)))
# filtering out 'Herkunftsland - Total' in column 'Herkunftsland' as it is just the total
tourism_data_no_total <- tourism_data %>% filter(Herkunftsland != "Herkunftsland - Total")
#check for NAN
sum(is.na(tourism_data_no_total))
#> [1] 51395
#analyse the NAN values, where are they
(tourism_data_no_total %>% filter(is.na(value)))
#> # A tibble: 51,395 x 6
#>    Herkunftsland                  Kanton Monat Jahr  value Date      
#>    <chr>                          <chr>  <fct> <chr> <dbl> <date>    
#>  1 Malta                          Schwe~ Janu~ 2005     NA 2005-01-01
#>  2 Zypern                         Schwe~ Janu~ 2005     NA 2005-01-01
#>  3 Mexiko                         Schwe~ Janu~ 2005     NA 2005-01-01
#>  4 Übriges Zentralamerika, Karib~ Schwe~ Janu~ 2005     NA 2005-01-01
#>  5 Bahrain                        Schwe~ Janu~ 2005     NA 2005-01-01
#>  6 Katar                          Schwe~ Janu~ 2005     NA 2005-01-01
#>  7 Kuwait                         Schwe~ Janu~ 2005     NA 2005-01-01
#>  8 Australien                     Schwe~ Janu~ 2005     NA 2005-01-01
#>  9 Neuseeland, Ozeanien           Schwe~ Janu~ 2005     NA 2005-01-01
#> 10 Oman                           Schwe~ Janu~ 2005     NA 2005-01-01
#> # i 51,385 more rows

2.1.1 Tourism Data - General Overview

The dataset contains information about the overnights stays by tourists in the various Swiss cantons. It indicates the tourist’s country of origin, the canton of stay, the month, the year and the total number of overnights stays.

Check the appendix for more details on the table and cleaning process.

2.1.2 Tourism Data - Vaud

Given the two objectives of the project, we are going to filter the initial dataset in order to keep and analyse only the cantons of interest. We start by filtering the “Kanton” column to keep only the canton of Vaud.

Here are the first 1000 instances of the raw data :

Click to show code
# Filter by canton Vaud 
tourism_vaud <- tourism_data %>% filter(Kanton == "Vaud")
#check for NAN
sum(is.na(tourism_vaud))
#> [1] 1869
#show the data in a table using reactable
reactable(head(tourism_vaud, 1000), 
          searchable = TRUE,
          defaultPageSize = 5,
          sortable = TRUE,
          filterable = TRUE,
          pagination = TRUE,
          bordered = TRUE,
          highlight = TRUE,
          striped = TRUE,
          compact = TRUE,
          showPageSizeOptions = TRUE,
          showSortable = TRUE
          )

Note that no missing values were found in the dataset.

2.1.3 Tourism Data - Zurich and Philippines

The data Zurich contained the total number of overnight stays in Zurich by tourists from different countries. We are interested in the number of overnight stays by tourists from the Philippines. See appendix for more details on the table.

Thus, we are filtering the “Kanton” column and the ‘Herkunftsland’ column, keeping “Zurich” and “Philippinen” for the country of origin.

Here is all the instances of the raw filtered data :

Click to show code
#filter column 'Kanton' for Zurich
tourism_data_zurich <- tourism_data_no_total %>% filter(Kanton == "Zürich")
#check for NAN
sum(is.na(tourism_data_zurich))
#> [1] 1869
#analyse the NAN values, where are they
tourism_data_zurich %>% filter(is.na(value))
#> # A tibble: 1,869 x 6
#>    Herkunftsland                  Kanton Monat Jahr  value Date      
#>    <chr>                          <chr>  <fct> <chr> <dbl> <date>    
#>  1 Malta                          Zürich Janu~ 2005     NA 2005-01-01
#>  2 Zypern                         Zürich Janu~ 2005     NA 2005-01-01
#>  3 Mexiko                         Zürich Janu~ 2005     NA 2005-01-01
#>  4 Übriges Zentralamerika, Karib~ Zürich Janu~ 2005     NA 2005-01-01
#>  5 Bahrain                        Zürich Janu~ 2005     NA 2005-01-01
#>  6 Katar                          Zürich Janu~ 2005     NA 2005-01-01
#>  7 Kuwait                         Zürich Janu~ 2005     NA 2005-01-01
#>  8 Australien                     Zürich Janu~ 2005     NA 2005-01-01
#>  9 Neuseeland, Ozeanien           Zürich Janu~ 2005     NA 2005-01-01
#> 10 Oman                           Zürich Janu~ 2005     NA 2005-01-01
#> # i 1,859 more rows

tourism_data_zurich_philippines <- tourism_data_zurich %>% filter(Herkunftsland == "Philippinen")
#show table using reactable
reactable(tourism_data_zurich_philippines, 
          searchable = TRUE,
          defaultPageSize = 5,
          sortable = TRUE,
          filterable = TRUE,
          pagination = TRUE,
          bordered = TRUE,
          highlight = TRUE,
          striped = TRUE,
          compact = TRUE,
          showPageSizeOptions = TRUE,
          showSortable = TRUE
          )

Filtering for Philippinen solved the problem of missing data we had with all countries of origin. The overnight stays are all included throughout the period.

3 EDA - Vaud

3.1 International Visitors in Vaud

The graph shows the monthly number of overnight stays in Vaud from tourists of different countries. The period runs from January 2005 to September 2023.

Click to show code
# Create the ggplot object
plot_vaud <- tourism_vaud %>% 
  filter(Herkunftsland != 'Herkunftsland - Total') %>%
  ggplot(aes(x = Date, y = value, group = Herkunftsland, color = Herkunftsland,
             text = paste("Country:", Herkunftsland, "Trips:", value))) +  # Added text for tooltip
  geom_line(show.legend = FALSE) + 
  scale_color_viridis_d() +  # Use viridis color palette
  labs(title = "Number of visitors from Each Country to Vaud",
       x = "Date",
       y = "Number of Trips") +
  theme_minimal()

# Convert to an interactive plotly object with specified width and height
interactive_plot <- ggplotly(plot_vaud, tooltip = "text", width = 600, height = 400)


# Adjust plotly settings 
interactive_plot <- interactive_plot %>%
  layout(
    margin = list(l = 60, r = 60, b = 60, t = 80),
    showlegend = FALSE # Remove legend
  )

# Display the interactive plot
interactive_plot

The time plot reveals some interesting features. - According to the graph, tourists come mainly from Switzerland. The second visitor country is France. - There are large dips in the number of overnight stays at the beginning of each year – these are due to holiday effects. - There was an important drop during the period in 2020 – this was due to the COVID pandemic. - For swiss tourists, there is visible increasing trend before and after the pandemic.

This time plot takes the total number of tourists in the canton of Vaud, combining all countries of origin. Here, we can better observe the seasonal pattern in the data. The number of tourists decreases at the end and beginning of each year and increases in the middle of the year during the summer holidays. There is also an increasing trend pattern if we do not take into account the period of the pandemic in 2020 which caused an important drop in travel and therefore tourism in Vaud. We’ll come back to this outlier later. Any forecasts of this series would need to capture the seasonal pattern, and the fact that the trend is changing over the period.

Graphical view of total number of tourists in canton Vaud :

Click to show code
tourism_vaud_total <- tourism_vaud %>%
  filter(Herkunftsland == 'Herkunftsland - Total') %>%
  select(-c(Herkunftsland, Kanton, Monat, Jahr))

# Create the ggplot object with viridis color palette
plot_vaud_total <- tourism_vaud_total %>%
  ggplot(aes(x = Date, y = value)) +
  geom_line(color = viridis(1)) +  # Use viridis color palette for a single line
  labs(x = "Date", y = "Number of tourists", title = "Total number of tourists in canton Vaud") + 
  theme_minimal()

# Convert to an interactive plotly object with specified width and height
interactive_plot_total <- ggplotly(plot_vaud_total, width = 600, height = 400)

# Adjust plotly settings
interactive_plot_total <- interactive_plot_total %>%
  layout(
    margin = list(l = 60, r = 60, b = 60, t = 80),
    showlegend = FALSE  # Remove legend if any
  )

# Display the interactive plot
interactive_plot_total

3.1.1 Decomposition

We have process an additive decomposition of the time series into three components: trend, seasonality and residual. These components will allow us to understand how they contribute to the variations observed in Swiss tourism data.

Click to show code
# Convert data to a time series object
vaud_ts <- tourism_vaud_total %>%
  arrange(Date) %>%
   # Filtre pour enlever les valeurs NA dans 'Date'
  filter(!is.na(Date)) %>%
  # Ensure data is complete and monthly
  complete(Date = seq.Date(min(Date, na.rm = TRUE), max(Date, na.rm = TRUE), by = "month")) %>%
  replace_na(list(value = 0)) %>%  # Replace NA values if there are any
  # Create a time series object
  with(ts(value, frequency = 12, start = decimal_date(min(Date, na.rm = TRUE))))

# Perform STL decomposition
 stl(vaud_ts, s.window = "periodic") %>% 
   autoplot() +
  labs(x = "Date", y = "Number of tourists", title = "STL Decomposition for number of tourists in canton Vaud") +
  theme_minimal()

The main insights from this decomposition reflect what we have already observed.

  • A clear upward trend until around 2020, when it peaks before falling sharply as a result of the pandemic and travel restrictions.

  • Monthly seasonality, with clear and regular fluctuations due to seasonal factors.

  • A stable residual component until 2020. After this period, there is a slight increase in volatility which may indicate that other events are having an impact on this time series which are not captured by the first two components.

4 EDA - Zurich

As for Vaud, the most frequent visitors to Zurich are Swiss. Germany and United States are the two main foreign countries to visit Zurich. This can be explained by the fact that the canton of Zurich is closer to Germany and therefore easier to reach. The same applies to France with the canton of Vaud. The yellow curve represents the Philippines. The curve is flat and shows a considerably small number of trips from this country over the period. There is a drastic fall in 2020 caused by COVID-19. The pandemic has had a significant impact on the tourism industry worldwide. At first glance, there are regular seasonal peaks for most countries which also suggest the presence of seasonality in tourism in the canton of Zurich. check the Appendix for more details with a Overall Zurichs Visitors graph

4.1 Zurich and Philippines Visitors

This graph shows only visitors from the Philippines, as this is the country of interest in our analysis.

Click to show code
# Use tourism_data_zurich_philippines data to plot the values in y axis and Date in x axis
p <- ggplot(tourism_data_zurich_philippines, aes(x = Date, y = value)) +
  geom_line(color = viridis(1)) +  # Use viridis color palette for a single line
  labs(title = "Number of Trips from Philippines to Zurich",
       x = "Date",
       y = "Number of Trips") +
  theme_minimal()

# Convert to an interactive plotly object with specified width and height
interactive_plot <- ggplotly(p, width = 600, height = 400)

# Adjust plotly settings
interactive_plot <- interactive_plot %>%
  layout(
    showlegend = FALSE  # Remove legend if any
  )

# Display the interactive plot
interactive_plot

4.1.1 Pattern

We choose a multiplicative model for the STL decomposition of the number of trips from the Philippines to Zurich because the data (previous plot) shows increasing variance and larger fluctuations as the number of trips grows, suggesting that seasonal and trend components scale proportionally with the overall level of the series. check the appendix for the additive decomposition and further seasonal graph

4.1.1.1 Decomposition

Click to show code
# Convert data to a time series object
tourism_ts <- tourism_data_zurich_philippines %>%
  arrange(Date) %>%
  # Ensure data is complete and monthly
  complete(Date = seq.Date(min(Date), max(Date), by = "month")) %>%
  replace_na(list(value = 0)) %>%  # Replace NA values if there are any
  # Create a time series object
  with(ts(log(value), frequency = 12, start = decimal_date(min(Date))))

# Perform STL decomposition on the transformed data
tourism_stl <- stl(tourism_ts, s.window = "periodic")

# Plotting the results (transforming back by exponentiating)
autoplot(tourism_stl) +
  labs(x = "Date", y = "Number of tourists", title = "Multiplicative STL Decomposition for number of tourists from Philippines to Zurich") +
  theme_minimal()

  • An upward trend until around 2020, when it falls sharply because of the pandemic and travel restrictions. The pandemic had a longer effect on the Philippine tourism, which stopped for a longer period (around 2 years or more).
  • Multiple peaks in the seasonal monthly component. These fluctuations are due to their calendar. Philippines start their summer holidays earlier than we do (31 of May - 29 of July) and have longer La Toussaint holidays (5 October - 18 October - 28 October).
  • A residual component with moderate variability which increases from 2020 onwards, indicating the influence of unforeseen or exceptional events (such as the pandemic) that have disrupted the usual models.

4.1.1.2 Seasonality

Click to show code
# several chart per month to see the seasonality
ggsubseriesplot(tourism_ts) + ylab("Number of tourists") + xlab("Month") + ggtitle("Seasonal subseries plot")

#debug
#better to use gg_subseries to see the seasonality
#tourism_ts %>% gg_subseries(value) + ylab("Number of tourists") + xlab("Month") + ggtitle("Seasonal subseries plot")

The months of May to July and October seem to have visitor peaks, which may indicate a high tourist season during this period. As we saw before, this is due to their calendar. The years 2022 and 2023 show a significant increase in visitor numbers compared with previous years. In particular, the months from May to October 2022 and 2023 show much higher values. This growth may be due to several factors, such as a post-pandemic recovery in travel or specific initiatives that have attracted more tourists.

4.1.1.3 Trend

There is an upward trend until around 2020, when it falls sharply because of the pandemic and travel restrictions. The pandemic had a longer effect on the Philippine tourism, which stopped for a longer period (around 2 years or more).

5 Modelling

In this section, we will apply forecasting techniques to model tourism data for Zurich, aiming to predict future trends. This involves carefully choosing aggregation levels for hierarchical time series to accurately capture data structure.

We build and justify various time series models, considering seasonality, outliers, collinearity, covariates, and special events. Each model is evaluated for its suitability and performance in handling the data’s complexities.

Finally, we select the best model based on a detailed assessment of forecasting accuracy, aiming to turn historical data into actionable insights for strategic decision-making in Zurich’s tourism sector.

5.1 Total number of visitors in Vaud

In this section we will focus on the total number of visitors in Vaud. We will use the ETS and ARIMA models to forecast the number of visitors in Vaud over a 15-month horizon.

5.1.1 ETS

The ETS (Error, Trend, Seasonality) forecast model presents numerous advantages in predicting the number of visitors to canton Vaud over a 15-month horizon due to its ease of implementation and reliability of results. The model is based on using trend and seasonality of past data to predict future without request of extra features.

Click to show code
ets_vaud <- ets(vaud_ts, model = "AAA")
forecast_ets_vaud <- forecast(ets_vaud, h = 24) %>% plot(main = "Forecast of visitors in Vaud", xlab = "Date", ylab = "Number of visitors")

Employing the ETS “AAA” model, characterized by additive consideration of error, trend, and seasonality components, we forecast number of visitors in canton Vaud for 15 months. The forecast generated by the ETS model appears reasonable, it follows trend and seasonality; however, the wide range of the confidence interval indicates a high level of uncertainty surrounding the predictions.

5.1.2 ARIMA

The automatic ARIMA (AutoRegressive Integrated Moving Average) model offers a data-driven approach to predict future trends. By automatically selecting the optimal parameters for the ARIMA model, including the order of autoregressive, differencing, and moving average components, this method simplifies the forecasting process while still capturing the underlying patterns in the data. Leveraging historical data, the automatic ARIMA model generates forecasts that account for both trend and seasonality.

Click to show code
# Convert your time series data to a tsibble object
vaud_tsibble <- as_tsibble(vaud_ts)

# Fit ARIMA model to the data
fit_vaud <- vaud_tsibble %>%
  model(auto_arima = ARIMA(value, stepwise = FALSE, approximation = FALSE))

# Generate forecasts for the next 2 years (24 months)
forecast_vaud <- fit_vaud %>%
  forecast(h = 15)

# Plot the forecast
forecast_vaud %>%
  autoplot(data = vaud_tsibble, main = "ARIMA Forecast for Vaud Tourism", ylab = "Number of Tourists")
#> Warning in ggdist::geom_lineribbon(without(intvl_mapping,
#> "colour_ramp"), : Ignoring unknown parameters: `main` and `ylab`
#> Warning in geom_line(mapping = without(mapping, "shape"), data =
#> unpack_data(object[single_row[["FALSE"]], : Ignoring unknown
#> parameters: `main` and `ylab`

#Provide forecast in table
as.data.frame(forecast_vaud) %>% kable(caption = "Forecast for Vaud Tourism") %>%
  kable_styling(full_width = FALSE)
Forecast for Vaud Tourism
.model index value .mean
auto_arima 2023 Oct N(133902, 8e+07) 133902
auto_arima 2023 Nov N(109552, 1.6e+08) 109552
auto_arima 2023 Dec N(114292, 2.1e+08) 114292
auto_arima 2024 Jan N(95066, 2.4e+08) 95066
auto_arima 2024 Feb N(1e+05, 2.5e+08) 102745
auto_arima 2024 Mar N(106504, 2.6e+08) 106504
auto_arima 2024 Apr N(107316, 2.7e+08) 107316
auto_arima 2024 May N(131745, 2.9e+08) 131745
auto_arima 2024 Jun N(142550, 3.1e+08) 142550
auto_arima 2024 Jul N(161690, 3.2e+08) 161690
auto_arima 2024 Aug N(156417, 3.3e+08) 156417
auto_arima 2024 Sep N(144912, 3.4e+08) 144912
auto_arima 2024 Oct N(124522, 3.6e+08) 124522
auto_arima 2024 Nov N(99671, 3.8e+08) 99671
auto_arima 2024 Dec N(105088, 4e+08) 105088

Automatically the model ARIMA(5,0,0)(0,1,1) was chosen. The graph shows that the seasonality observed in the historical data continues to manifest itself in the future forecasts, with recurring seasonal peaks and troughs. This demonstrates the robustness of the ARIMA model Forecasts for the next two years (15 months) are represented by the blue line. The violet bands around the blue line represent the forecast confidence interval, indicating the range of values within which future values are likely to lie with a certain probability. Due to the drop observed in 2020, the trend is showing a slight downward trajectory, however, currently, there are no evident reasons for a decrease in the number of visitors.

To capture the positive trend we use model ARIMA (5,1,0)(0,1,1). Changing models explicitly using a differencing order of 1 in the non-seasonal component provide better flexibility in capturing any underlying trend patterns in the data. This adjustment allows the model to more effectively account for changes in the level of the time series data over time, potentially leading to improved accuracy in forecasting future trends in tourist arrivals in Canton Vaud.

Click to show code
# Fit ARIMA model with specified parameters
arima_model <- arima(vaud_ts, order = c(5, 1, 0), seasonal = list(order = c(0, 1, 1), period = 12))

forecast_a_vaud <- arima_model %>%
  forecast(h = 15)

# Plot the forecast
forecast_a_vaud %>%
  autoplot(data = vaud_tsibble, main = "ARIMA Forecast for Vaud Tourism", ylab = "Number of Tourists")

#Provide forecast in table
as.data.frame(forecast_a_vaud) %>% kable(caption = "Forecast for Vaud Tourism") %>%
  kable_styling(full_width = FALSE)
Forecast for Vaud Tourism
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
Oct 2023 135748 124322 147173 118273 153222
Nov 2023 113990 97868 130113 89333 138648
Dec 2023 120876 101978 139773 91974 149777
Jan 2024 103671 83059 124283 72148 135194
Feb 2024 111674 90386 132962 79117 144231
Mar 2024 116276 94369 138183 82773 149779
Apr 2024 117674 94905 140443 82852 152496
May 2024 143215 119344 167085 106708 179721
Jun 2024 155473 130321 180626 117006 193941
Jul 2024 175637 149289 201986 135341 215934
Aug 2024 171198 143848 198548 129370 213026
Sep 2024 160363 132151 188574 117217 203508
Oct 2024 140898 111269 170528 95584 186213
Nov 2024 117389 86377 148401 69960 164817
Dec 2024 123921 91608 156234 74503 173339

The ARIMA(5,1,0)(0,1,1) model effectively captures both the positive trend and seasonality present in the data, resulting in a realistic forecasting scenario. The approach results in forecasts that closely align with observed patterns, leading to a narrow confidence interval enhancing the reliability of predictions for tourist arrivals in canton Vaud.

5.2 Zurich and Philippines

In this section, we will focus on the number of visitors from the Philippines to Zurich. We will use the Naive as a benchmark for ETS and ARIMA models to forecast the number of visitors from the Philippines to Zurich over a 15-month horizon.

5.2.1 Forecast without dealing with Covid

As we have seen, Covid has had a significant impact on the number of tourists in Zurich. We will first forecast the number of tourists without taking into account the Covid period. This will allow us to see the impact of the pandemic on the forecasts.

5.2.1.1 ETS

We first run a NAIVE model to have a benchmark for the other model. And a AAA model for the ETS model for ensuring that the model is indeed worse than a multiplicative one, based on our previous observations. see appendix for the plots and metrics of those 2 models

Click to show code
###### NAIVE part #####
#convert tourism_ts to tsibble
tourism_ts <- tourism_ts %>% as_tsibble()
# Fit a naive model
fit <- tourism_ts %>%
  model(NAIVE = NAIVE(value))
# Forecast the next 2 years periods
forecast <- fit %>%
  forecast(h = 15)
# Plot the forecasts along with the historical data, and make the colors of the forecast a bit more transparent for distinguishably purposes
plot_naive <- forecast %>%
  autoplot(tourism_ts, alpha = 0.5) +
  labs(title = "Forecast of tourists from Philippines to Zurich",
       x = "Date",
       y = "Number of tourists") + guides(colour = guide_legend(title = "Forecast"))

metrics_naive <- fit %>% accuracy()
# Display accuracy metrics in an HTML table
metrics_naive <- metrics_naive %>%
  kable("html", caption = "Naive Model Metrics") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

###### ETS PART AAA #####

# Fit an ETS model
# Adjusting the model parameters according to the characteristics of the data
# Here "A" means additive error, "N" means no trend, and "N" means no seasonality
# change these if needed
fit <- tourism_ts %>%
  model(ETS_M_seaso = ETS(value ~ error("A") + trend("A") + season("A")))
# Forecast the next 2 years periods
forecast <- fit %>%
  forecast(h = 15)
# Plot the forecasts along with the historical data, and make the colors of the forecast a bit more transparent for distinguishably purposes
plot_ets_AAA <- forecast %>%
  autoplot(tourism_ts, alpha = 0.5) +
  labs(title = "Forecast of tourists from Philippines to Zurich",
       x = "Date",
       y = "Number of tourists") + guides(colour = guide_legend(title = "Forecast"))

# Calculate the accuracy of the training set
metrics_ets_AAA <- fit %>% accuracy() %>%
  kable("html", caption = "ETS AAA Metrics") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Now we will forecast the number of tourists from the Philippines to Zurich using the ETS AAM AND AAdM models.

Click to show code
# comparing several model
fit <- tourism_ts %>%
  model(
        ETS_M_seaso = ETS(value ~ error("A") + trend("A") + season("M")), #multiplicative seasonality
        ETS_M_seaso_Ad = ETS(value ~ error("A") + trend("Ad") + season("M")), #dampted trend
  )

# Forecast the next 2 years periods
forecast <- fit %>%
  forecast(h = 15)

# Extract the forecast for ETS_M_seaso_Ad
forecast_ETS_M_seaso_Ad <- forecast %>%
  filter(.model == "ETS_M_seaso_Ad") %>%
  as.data.frame()
# Plot the forecasts along with the historical data, and make the colors of the forecast a bit more transparent for distinguishably purposes
plot <- forecast %>%
  autoplot(tourism_ts, level = 90, color = "blue", alpha = 0.5) +
  labs(title = "Forecast of tourists from Philippines to Zurich",
       x = "Date",
       y = "Number of tourists") + guides(colour = guide_legend(title = "Forecast"))
plot

We can see here the metrics of the two ETS models (AAM and AAdM) :

Click to show code
# Extract AIC values
aic_values <- fit %>%
  glance() %>%
  select(.model, AIC)

# Print the AIC values
print(aic_values)
#> # A tibble: 2 x 2
#>   .model           AIC
#>   <chr>          <dbl>
#> 1 ETS_M_seaso     887.
#> 2 ETS_M_seaso_Ad  897.

# Display AIC values with forecast metrics
metrics <- fit %>% accuracy()
metrics_with_aic <- metrics %>%
  left_join(aic_values, by = ".model")

# Display metrics with AIC in an HTML table
metrics_ets <- metrics_with_aic %>%
  kable("html", caption = "Model Accuracy Metrics with AIC Values") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

metrics_ets
Model Accuracy Metrics with AIC Values
.model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1 AIC
ETS_M_seaso Training 0.003 0.443 0.287 -1.52 7.69 0.430 0.385 0.057 887
ETS_M_seaso_Ad Training 0.008 0.452 0.300 -1.45 7.92 0.449 0.392 0.048 897

We observe that the damped model is better on almost all metrics.

5.2.1.2 ARIMA

We fit 3 different models

  • one with auto-arima
  • Anoter one with auto-arima but with seasonality added
  • And one with auto-arima but with stepwise and approximation options turned off for a more thorough search.

NOTE A REFORMULER PLUS PETIT Reasons to Turn Off Stepwise and Approximation Accuracy Over Speed: If computational resources and time are not a constraint, turning off these options can lead to more accurate and reliable model selection and parameter estimation. Better Model Selection: An exhaustive search (with stepwise = FALSE) and exact estimation (with approximation = FALSE) increase the chances of identifying the true underlying model, which can improve forecast accuracy. Complex Data: For datasets with complex patterns or when high accuracy is critical, the benefits of thorough model exploration and exact estimation outweigh the increased computational cost.

Click to show code
# Fit an automatic ARIMA model
fit_arima <- tourism_ts %>%
  model(ARIMA_auto = ARIMA(value),
        ARIMA_auto_seasonal = ARIMA(value~season()),
        ARIMA_auto_stepwise = ARIMA(value, stepwise = FALSE, approximation = FALSE))

# Forecast the next 15 months
forecast_arima <- fit_arima %>%
  forecast(h = 15)

# Plot the forecasts along with the historical data
plot_arima <- forecast_arima %>%
  autoplot(tourism_ts, alpha = 0.3) +
  labs(title = "ARIMA Forecast of Tourists from the Philippines to Zurich",
       x = "Date",
       y = "Number of Tourists") +
  guides(colour = guide_legend(title = "Forecast"))
plot_arima

We see that that the Auto-ARIMA with seasonality is more negative than the two others and that the stepwise and approximation options turned off is more positive. The auto-arima alone is in between both.

Here are the metrics of the ARIMA model :

Click to show code
# Extract AIC values
aic_values <- fit_arima %>%
  glance() %>%
  select(.model, AIC)

# Calculate the accuracy of the training set
metrics_arima <- fit_arima %>% accuracy()

# Combine accuracy metrics with AIC values
metrics_arima <- metrics_arima %>%
  left_join(aic_values, by = ".model")

# Display accuracy metrics with AIC values in an HTML table
metrics_arima_table <- metrics_arima %>%
  kable("html", caption = "Model Accuracy Metrics with AIC Values") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

metrics_arima_table
Model Accuracy Metrics with AIC Values
.model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1 AIC
ARIMA_auto Training 0.013 0.450 0.305 -1.28 7.99 0.457 0.391 0.03 293
ARIMA_auto_seasonal Training 0.011 0.435 0.282 -1.26 7.46 0.422 0.378 0.01 297
ARIMA_auto_stepwise Training 0.013 0.450 0.305 -1.28 7.99 0.457 0.391 0.03 293

We observe with the metrics that the more complex model is the best by almost metrics but followed very closely by the auto-arima with seasonality model. The lowest AIC for the more complex model is a good sign that it is the best model.

5.2.2 Forecasting Amidst the Covid-19 Pandemic

Now after those forecast we will take into account the Covid period and see if it will provide a better forecast.

5.2.2.1 How Philippines dealt with it

In 2021, the Philippines faced the challenges of the COVID-19 pandemic with a mix of resilience and adaptation.

CHANGE THIS TEXT ITS BULLSHIT ***** - Lockdowns and Waves: The country experienced two waves of COVID-19, leading to prolonged lockdowns throughout the year. These restrictions aimed to curb the spread of the virus. - Vaccination Campaign: Despite challenges, millions of Filipinos received COVID-19 vaccines. However, experts raised concerns about the campaign’s sluggish pace. - Senate Investigations: Lawmakers probed alleged anomalies in pandemic contracts, leading to marathon Senate hearings. - Easing Restrictions: Towards the end of the year, daily cases dropped, and mandatory face shield policies were lifted. This signaled progress in overcoming the crisis. - Risk-Based Decisions: While the holidays looked promising, the threat of new variants remained. Experts advised practicing “risk-based” decisions for activities despite low case numbers1. - Filipinos have become more mindful of hygiene practices, including social distancing, mask-wearing, and proper handwashing. The pandemic also prompted a shift in consumption patterns, with increased focus on essentials and at-home entertainment. However, air travel remains limited due to ongoing concerns.

As for travel, the Philippines continues to navigate the balance between safety and economic recovery. While some travel restrictions have eased, travellers must stay informed about evolving guidelines and exercise caution when planning trips. The threat of new variants underscores the need for vigilance and informed decision-making1 ******

Question : How to deal with this blackswan event ?

5.2.2.2 Covariates

TEXT A PAUFINER Covariate Adjustment Adjust your forecasts to account for the impact of COVID-19 by including covariates that capture the effects of the pandemic. These covariates can be used to adjust the forecasts based on the observed changes in the data due to COVID-19. For example, you can include a covariate that captures the effect of lockdowns or travel restrictions on tourism data. Scenario Analysis Conduct scenario analysis to explore the potential impact of different COVID-19 scenarios on your forecasts. By considering a range of possible outcomes, you can better prepare for the uncertainties associated with the pandemic. Sensitivity Analysis Evaluate the sensitivity of your forecasts to changes in key assumptions or parameters. By conducting sensitivity analysis, you can identify the factors that have the greatest impact on your forecasts and assess the robustness of your models.

5.2.2.3 Data Integration

The Oxford COVID-19 Government Response Tracker (OxCGRT) provides valuable data on government responses to the COVID-19 pandemic, including a Stringency Index that quantifies the severity of lockdown measures. Let me provide more details about this index:

Stringency Index: The Stringency Index is a composite measure that evaluates the strictness of government policies related to COVID-19. It calculates a score based on nine key response metrics: - School closures - Workplace closures - Cancellation of public events - Restrictions on public gatherings - Closures of public transport - Stay-at-home requirements - Public information campaigns - Restrictions on internal movements - International travel controls Each metric is assigned a value between 0 and 100, with a higher score indicating a stricter response. The overall Stringency Index is the mean score of these nine metrics. If policies vary at the subnational level, the index reflects the strictest sub-region’s response level. Importantly, the Stringency Index records the strictness of government policies but does not measure or imply the appropriateness or effectiveness of a country’s response. A higher score does not necessarily mean that a country’s response is better than others lower on the index.

source - Our World In Data

Here you can observe the stringency index for the Philippines and Switzerland :

Click to show code
# Convert data to a time series object
tourism_ts <- tourism_data_zurich_philippines
#rename Date as date
names(tourism_ts)[names(tourism_ts) == "Date"] <- "date"
  
#read .csv with stringency index
stringency_df <- read.csv(here("data/stringency_index.csv"))

# Filter data by location
stringency_philippines <- filter(stringency_df, location == "Philippines")
stringency_switzerland <- filter(stringency_df, location == "Switzerland")

# Convert dates and set them to the first day of the month
stringency_philippines$date <- as.Date(format(dmy(stringency_philippines$date), "%Y-%m-01"))
stringency_switzerland$date <- as.Date(format(dmy(stringency_switzerland$date), "%Y-%m-01"))

# Aggregate to monthly average, ensuring date format is maintained
stringency_philippines <- stringency_philippines %>%
  group_by(date) %>%
  summarize(avg_stringency_index = mean(stringency_index, na.rm = TRUE))

stringency_switzerland <- stringency_switzerland %>%
  group_by(date) %>%
  summarize(avg_stringency_index = mean(stringency_index, na.rm = TRUE))

# Merge with Philippines data first
merged_data_philippines <- merge(tourism_ts, stringency_philippines, by = "date", all.x = TRUE)

# Then merge with Switzerland data
merged_data <- merge(merged_data_philippines, stringency_switzerland, by = "date", all.x = TRUE, suffixes = c("_PH", "_SW"))

# Replace NA values in avg_stringency_index with 0 if necessary
merged_data$avg_stringency_index_PH[is.na(merged_data$avg_stringency_index_PH)] <- 0
merged_data$avg_stringency_index_SW[is.na(merged_data$avg_stringency_index_SW)] <- 0


# Create a ggplot of the stringency index
ggplot(merged_data, aes(x = date, y = avg_stringency_index_PH, color = "Philippines")) +
  geom_line() +
  geom_line(aes(y = avg_stringency_index_SW, color = "Switzerland")) +
  labs(title = "Stringency Index in the Philippines and Switzerland",
       x = "Date",
       y = "Stringency Index") +
  scale_color_manual(values = c("#3C5B6F", "darkred"),
                     labels = c("Philippines", "Switzerland"))

We see that Philippines had a higher stringency index than Switzerland.

You can observe here the resulted data from the merge of the stringency index and the tourism data :

Click to show code
#show merge data using reactable
reactable(merged_data, 
          searchable = TRUE,
          defaultPageSize = 5,
          sortable = TRUE,
          filterable = TRUE,
          pagination = TRUE,
          bordered = TRUE,
          highlight = TRUE,
          striped = TRUE,
          compact = TRUE,
          showPageSizeOptions = TRUE,
          showSortable = TRUE
          )

5.2.2.4 ARIMAX

We choose the ARIMAX as multiple regression model did not provide satisfying results. see appendix

We try here to integrate the stringency index as an exogenous variable in the ARIMA model. This will allow us to account for the impact of COVID-19 on the number of tourists from the Philippines to Zurich.

We fix seasonality as TRUE and stepwise and approximation as FALSE to have a more thorough search for the best model, as we saw before this provided better results.

Click to show code
# Transform to a time series object with frequency 12 (monthly data)
tourism_ts <- ts(merged_data$value, frequency = 12, start = c(2005, 1))  # Adjust the start time as per your data

# Ensure exogenous variables have the same length and frequency
exog_data <- cbind(merged_data$avg_stringency_index_PH, merged_data$avg_stringency_index_SW)

# Check if lengths of tourism_ts and exog_data match
if (length(tourism_ts) == nrow(exog_data)) {
  # Fit an ARIMAX model
  model <- auto.arima(tourism_ts, xreg = exog_data, seasonal = TRUE, stepwise = FALSE, approximation = FALSE)
  
  # Summary of the model
  summary(model)
  
  # Forecast the next 15 months, setting future exogenous variables to 0
  future_exog <- matrix(0, nrow = 15, ncol = 2)
  
  forecast_values <- forecast(model, xreg = future_exog, h = 24)
  
  # Convert the forecast to a data frame
  forecast_arimax <- as.data.frame(forecast_values)
  
  # Rename the columns for clarity
  colnames(forecast_arimax) <- c("Point Forecast", "Lo 80", "Hi 80",
                                 "Lo 95", "Hi 95")
  forecast_arimax$Date <- seq(as.Date("2023-10-01"), by = "month",
                              length.out = 15)
  # Plot the forecast along with the actual data using autoplot from the forecast package
  plot_forecast <- autoplot(forecast_values, series = "Forecast") +
    autolayer(tourism_ts, series = "Actual Data") +
    labs(title = "ARIMAX Forecast of Tourists from the Philippines to Zurich",
         x = "Date",
         y = "Number of Tourists") +
    guides(colour = guide_legend(title = "Data Type"))
  
  plot_forecast
    # Calculate evaluation metrics on the training data
  residuals <- residuals(model)
  mae <- mean(abs(residuals))
  mape <- mean(abs(residuals / tourism_ts)) * 100
  rmse <- sqrt(mean(residuals^2))
  aic <- AIC(model)
  bic <- BIC(model)
  
  # Print evaluation metrics
  cat("Evaluation Metrics:\n")
  cat("MAE:", mae, "\n")
  cat("MAPE:", mape, "\n")
  cat("RMSE:", rmse, "\n")
  cat("AIC:", aic, "\n")
  cat("BIC:", bic, "\n")
} else {
  stop("Length of tourism_ts and exog_data do not match. Please check the data.")
}
#> Evaluation Metrics:
#> MAE: 65 
#> MAPE: 74.9 
#> RMSE: 105 
#> AIC: 2744 
#> BIC: 2774

Here are the metrics of the ARIMAX model with the metrics of the ARIMA model for comparison :

Click to show code
#show metric in html
model_metrics <- data.frame(
  Model = "ARIMAX",
  MAE = mae,
  MAPE = mape,
  RMSE = rmse,
  AIC = aic,
  BIC = bic
)

#show html table with metrics
metrics_arimax <- model_metrics %>%
  kable("html", caption = "Model Metrics") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
metrics_arimax
Model Metrics
Model MAE MAPE RMSE AIC BIC
ARIMAX 65 74.9 105 2744 2774
Click to show code
metrics_arima_table
Model Accuracy Metrics with AIC Values
.model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1 AIC
ARIMA_auto Training 0.013 0.450 0.305 -1.28 7.99 0.457 0.391 0.03 293
ARIMA_auto_seasonal Training 0.011 0.435 0.282 -1.26 7.46 0.422 0.378 0.01 297
ARIMA_auto_stepwise Training 0.013 0.450 0.305 -1.28 7.99 0.457 0.391 0.03 293

Summary a reformuler Best in MAE and MAPE: ARIMAX Best in RMSE: ARIMA_auto_stepwise Best in AIC: ARIMA_auto_stepwise Overall, ARIMA_auto_stepwise shows the best balance in terms of RMSE and AIC, while ARIMAX excels in MAE and MAPE. Depending on which metric is prioritized, either ARIMAX or ARIMA_auto_stepwise would be the preferred model.

5.2.2.5 Other ideas

A Ddeveloper, pq on a pas choisi, bblabla, … - Dummy variable for the covid period. We could have a dummy variable that takes the value 1 during the covid period and 0 otherwise. This would allow the model to adjust the forecasts to reflect the changes in the data due to COVID-19. - Replacing covid values with the ARIMA, If the missing values are random or if excluding them would result in a loss of valuable information, we might consider imputing them. One common approach is to use statistical models like ARIMA to interpolate missing values based on the patterns observed in the available data. - Delete the timestamp of the covid period and use fill_gaps() to fill the missing values and then use the a model to predict the missing values.

6 Model Selection

6.1 Vaud

Implement model selection metrics comparison,blabla and logic behind it

6.2 Zurich and Philippines

To reformulate !! Use Information Criteria for Model Comparison: Evaluate models based on criteria such as RMSE, MAE, MAPE, AIC (Akaike Information Criterion) or BIC (Bayesian Information Criterion), which help in selecting a model that balances goodness of fit with complexity.

Click to show code
metrics_naive
Naive Model Metrics
.model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
NAIVE Training 0.012 0.5 0.369 -1.03 8.9 0.553 0.434 -0.103
Click to show code
metrics_ets
Model Accuracy Metrics with AIC Values
.model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1 AIC
ETS_M_seaso Training 0.003 0.443 0.287 -1.52 7.69 0.430 0.385 0.057 887
ETS_M_seaso_Ad Training 0.008 0.452 0.300 -1.45 7.92 0.449 0.392 0.048 897
Click to show code
metrics_arima_table
Model Accuracy Metrics with AIC Values
.model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1 AIC
ARIMA_auto Training 0.013 0.450 0.305 -1.28 7.99 0.457 0.391 0.03 293
ARIMA_auto_seasonal Training 0.011 0.435 0.282 -1.26 7.46 0.422 0.378 0.01 297
ARIMA_auto_stepwise Training 0.013 0.450 0.305 -1.28 7.99 0.457 0.391 0.03 293
Click to show code
metrics_arimax
Model Metrics
Model MAE MAPE RMSE AIC BIC
ARIMAX 65 74.9 105 2744 2774

Based on the comparison of the metrics:

  • ETS_M_seaso_Ad stands out with the lowest MAE (48.9) and RMSE (74.7), making it the most accurate model in terms of absolute and square error metrics. However, its MAPE (105) is not as low as the ARIMAX model’s MAPE (74.9).
  • ARIMAX has a competitive AIC (2744) and a lower MAPE (74.9) compared to the ETS models, but a higher MAE (65) and RMSE (105) than ETS_M_seaso_Ad.

Given that the primary goal is to predict the number of visitors accurately:

  • ETS_M_seaso_Ad is recommended because it has the lowest MAE and RMSE, indicating more precise predictions, which are crucial for ensuring accurate and actionable insights for decision-making in Zurich’s tourism sector.
  • Although the ARIMAX model has a competitive AIC and a reasonable MAPE, its MAE and RMSE are not as competitive as the ETS_M_seaso_Ad model.

Using both forecast hand in hand could be a better approach to have a more robust forecast than selecting only one model.

Here are the forecasted values for the two models :

Click to show code
# Assuming forecast_arimax and forecast_ETS_M_seaso_Ad are already available

# Rename the columns for clarity
colnames(forecast_arimax) <- c("ARIMAX_Forecast", "ARIMAX_Lo_80", "ARIMAX_Hi_80", "ARIMAX_Lo_95", "ARIMAX_Hi_95", "Date")

# Convert the 'Date' column to Date type for merging
forecast_arimax$Date <- as.Date(forecast_arimax$Date)

# Extract the relevant columns from forecast_ETS_M_seaso_Ad
forecast_ETS_M_seaso_Ad <- forecast_ETS_M_seaso_Ad %>%
  select(Date = index, ETS_M_seaso_Ad_Forecast = .mean)

# Convert the 'Date' column to Date type for merging
forecast_ETS_M_seaso_Ad$Date <- as.Date(paste(forecast_ETS_M_seaso_Ad$Date, "01", sep = "-"), format = "%Y %b-%d")

# Merge the two data frames based on the Date
combined_forecast <- merge(forecast_arimax, forecast_ETS_M_seaso_Ad, by = "Date")

# Calculate the delta between the two forecasts
combined_forecast <- combined_forecast %>%
  mutate(Delta = ETS_M_seaso_Ad_Forecast - ARIMAX_Forecast)

# Remove the unnecessary columns
combined_forecast <- combined_forecast %>%
  select(Date, ARIMAX_Forecast, ETS_M_seaso_Ad_Forecast, Delta)

# Display the combined table using reactable
reactable(
  combined_forecast,
  columns = list(
    Date = colDef(name = "Date"),
    ARIMAX_Forecast = colDef(
      name = "ARIMAX Forecast",
      format = colFormat(digits = 2)
    ),
    ETS_M_seaso_Ad_Forecast = colDef(
      name = "ETS_M_seaso_Ad Forecast",
      format = colFormat(digits = 2)
    ),
    Delta = colDef(
      name = "Delta",
      format = colFormat(digits = 2)
    )
  ),
  defaultPageSize = 5,
  highlight = TRUE,
  striped = TRUE,
  bordered = TRUE,
  filterable = TRUE,
  searchable = TRUE,
  sortable = TRUE,
  resizable = TRUE
)

7 Appendix

7.1 General Overview

As the dataset provided is in German, we have translated the data in English to make it more intuitive and understandable for everyone. Then, we created a new ‘Date’ column, year-month-day, which corresponds to the correct format to be able to make predictions.

Here are the first 1000 instances of the raw data :

Click to show code
#show data using reactable only showing the first 100 rows
reactable(head(tourism_data_no_total, 1000), 
          searchable = TRUE,
          defaultPageSize = 5,
          sortable = TRUE,
          filterable = TRUE,
          pagination = TRUE,
          bordered = TRUE,
          highlight = TRUE,
          striped = TRUE,
          compact = TRUE,
          showPageSizeOptions = TRUE,
          showSortable = TRUE
          )

7.2 ZH - Tourism Data

We filtered the “Kanton” column to keep only the canton of Zurich.

Here are the first 1000 instances of the raw data :

Click to show code
#show the data in a table using reactable
reactable(head(tourism_data_zurich, 1000), 
          searchable = TRUE,
          defaultPageSize = 5,
          sortable = TRUE,
          filterable = TRUE,
          pagination = TRUE,
          bordered = TRUE,
          highlight = TRUE,
          striped = TRUE,
          compact = TRUE,
          showPageSizeOptions = TRUE,
          showSortable = TRUE
          )

There are 1869 missing values for the two sub-datasets. These missing values come from the ‘value’ column, creating gaps in the time series. We’ll see later how we’re going to process them to do modelling.

7.3 ZH -Zurich and All visiting countries

The graph shows the monthly number of overnight stays in Zurich from tourists of different countries.

Click to show code
# Preparing the data
#removing value in column 'Herkunftsland' as it is just the whole of Switzerland
data <- tourism_data_zurich %>%
  filter(!is.na(value)) %>%  # Removing rows with NA values in the 'value' column
  mutate(Monat = month(Date, label = TRUE, abbr = TRUE),  # Extract month 
         Jahr = year(Date)) %>%  # Extract year from Date
  group_by(Herkunftsland, Date) %>%  # Group by country and date
  summarise(Trips = sum(value), .groups = 'drop')  # Summing up trips for each country per date

p <- ggplot(data, aes(x = Date, y = Trips, group = Herkunftsland,
                      color = Herkunftsland == "Philippinen",
                      text = paste("Country:", Herkunftsland, "<br>Trips:", Trips))) +  # Added text for tooltip
  geom_line(show.legend = FALSE) +
    scale_color_viridis_d() +  # Use viridis color palette
  labs(title = "Number of Trips from Each Country to Zurich",
       x = "Date",
       y = "Number of Trips") +
  theme_minimal()

# Convert to an interactive plotly object
interactive_plot <- ggplotly(p, tooltip = "text", width = 600, height = 600)

# Adjust plotly settings 
interactive_plot <- interactive_plot %>%
  layout(
    margin = list(l = 60, r = 60, b = 60, t = 80),  # Adjust margins
    showlegend = FALSE  # Show legend
  )

# Display the interactive plot
interactive_plot

7.4 ZH -Additive STL

The additive time series decomposition of the monthly overnight stays for tourists coming from the Philippines to the canton of Zurich shows:

Click to show code
# Convert data to a time series object
tourism_ts <- tourism_data_zurich_philippines %>%
  arrange(Date) %>%
  # Ensure data is complete and monthly
  complete(Date = seq.Date(min(Date), max(Date), by = "month")) %>%
  replace_na(list(value = 0)) %>%  # Replace NA values if there are any
  # Create a time series object
  with(ts(value, frequency = 12, start = decimal_date(min(Date))))

# Decompose the time series
decomposed <- decompose(tourism_ts, type = 'multiplicative')

# Plot the decomposed components
decomposed
#> $x
#>       Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec
#> 2005   57   30   46   73   74   73   57   69   70  103   46   47
#> 2006   49   29   55   81   82  113  115   87  100  134   90   52
#> 2007   77   35   50   83  109   94   42   52  112   66   45   48
#> 2008  122   55   58   82   76   58   46   53   58   70   35   45
#> 2009   76   35   95   88   89   90  100   96  111   81   75   46
#> 2010   52   66   83  127  164  125  106  105  139  219  110   70
#> 2011  103   92   86  177  161  189  126  175  169  201  195   72
#> 2012  120  131  138  208  286  191  110  123  227  247  145  110
#> 2013  181   66  159  233  239  182  136  128  156  205  174  104
#> 2014   96   89  138  287  316  223  225  194  326  340  158  230
#> 2015  109  139  239  290  478  295  326  273  320  388  221  269
#> 2016  201  202  297  556  442  517  343  278  387  495  268  383
#> 2017  202  344  282  810  746  437  560  377  459  519  280  513
#> 2018  178  358  368  563  657  536  461  376  446  498  370  518
#> 2019  201  161  209  706  661  731  576  401  442  562  334  749
#> 2020  202  133   76    3    3    9   35   24   18   14   13   22
#> 2021   14    9   10   13   18   18   49   56   78   68   74  132
#> 2022   78   51   85  142  282  516  473  452  578  850  655 1159
#> 2023  466  344  537 1096 1071 1120 1036  586  827               
#> 
#> $seasonal
#>        Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct
#> 2005 0.760 0.576 0.751 1.182 1.287 1.128 1.016 0.952 1.195 1.326
#> 2006 0.760 0.576 0.751 1.182 1.287 1.128 1.016 0.952 1.195 1.326
#> 2007 0.760 0.576 0.751 1.182 1.287 1.128 1.016 0.952 1.195 1.326
#> 2008 0.760 0.576 0.751 1.182 1.287 1.128 1.016 0.952 1.195 1.326
#> 2009 0.760 0.576 0.751 1.182 1.287 1.128 1.016 0.952 1.195 1.326
#> 2010 0.760 0.576 0.751 1.182 1.287 1.128 1.016 0.952 1.195 1.326
#> 2011 0.760 0.576 0.751 1.182 1.287 1.128 1.016 0.952 1.195 1.326
#> 2012 0.760 0.576 0.751 1.182 1.287 1.128 1.016 0.952 1.195 1.326
#> 2013 0.760 0.576 0.751 1.182 1.287 1.128 1.016 0.952 1.195 1.326
#> 2014 0.760 0.576 0.751 1.182 1.287 1.128 1.016 0.952 1.195 1.326
#> 2015 0.760 0.576 0.751 1.182 1.287 1.128 1.016 0.952 1.195 1.326
#> 2016 0.760 0.576 0.751 1.182 1.287 1.128 1.016 0.952 1.195 1.326
#> 2017 0.760 0.576 0.751 1.182 1.287 1.128 1.016 0.952 1.195 1.326
#> 2018 0.760 0.576 0.751 1.182 1.287 1.128 1.016 0.952 1.195 1.326
#> 2019 0.760 0.576 0.751 1.182 1.287 1.128 1.016 0.952 1.195 1.326
#> 2020 0.760 0.576 0.751 1.182 1.287 1.128 1.016 0.952 1.195 1.326
#> 2021 0.760 0.576 0.751 1.182 1.287 1.128 1.016 0.952 1.195 1.326
#> 2022 0.760 0.576 0.751 1.182 1.287 1.128 1.016 0.952 1.195 1.326
#> 2023 0.760 0.576 0.751 1.182 1.287 1.128 1.016 0.952 1.195      
#>        Nov   Dec
#> 2005 0.857 0.970
#> 2006 0.857 0.970
#> 2007 0.857 0.970
#> 2008 0.857 0.970
#> 2009 0.857 0.970
#> 2010 0.857 0.970
#> 2011 0.857 0.970
#> 2012 0.857 0.970
#> 2013 0.857 0.970
#> 2014 0.857 0.970
#> 2015 0.857 0.970
#> 2016 0.857 0.970
#> 2017 0.857 0.970
#> 2018 0.857 0.970
#> 2019 0.857 0.970
#> 2020 0.857 0.970
#> 2021 0.857 0.970
#> 2022 0.857 0.970
#> 2023            
#> 
#> $trend
#>        Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct
#> 2005    NA    NA    NA    NA    NA    NA  61.8  61.4  61.7  62.4
#> 2006  69.2  72.3  74.3  76.9  80.0  82.0  83.4  84.8  84.9  84.7
#> 2007  82.5  78.0  77.0  74.7  70.0  67.9  69.6  72.3  73.5  73.8
#> 2008  68.2  68.4  66.2  64.1  63.8  63.3  61.3  58.5  59.2  61.0
#> 2009  67.2  71.3  75.3  78.0  80.1  81.8  80.8  81.1  81.9  83.0
#> 2010  94.1  94.7  96.2 103.2 110.4 112.8 116.0 119.2 120.4 122.6
#> 2011 130.6 134.3 138.5 139.0 141.8 145.4 146.2 148.5 152.3 155.8
#> 2012 167.0 164.2 164.4 168.7 168.6 168.1 172.2 172.0 170.2 172.1
#> 2013 169.6 170.9 168.1 163.4 162.9 163.8 160.0 157.5 157.5 158.9
#> 2014 174.7 181.2 191.0 203.7 208.7 213.2 219.0 221.7 228.0 232.3
#> 2015 256.1 263.6 266.7 268.4 273.0 277.3 282.7 289.2 294.2 307.7
#> 2016 335.0 336.0 339.0 346.2 352.6 359.3 364.1 370.1 375.4 385.3
#> 2017 423.6 436.8 443.9 447.9 449.4 455.3 459.8 459.3 463.5 456.8
#> 2018 443.2 439.0 438.5 437.0 439.9 443.9 445.0 437.8 423.0 422.3
#> 2019 449.6 455.5 456.3 458.8 460.0 468.1 477.8 476.7 470.0 435.1
#> 2020 268.3 230.0 196.7 156.2 120.0  76.3  38.2  25.2  17.2  14.9
#> 2021  17.9  19.8  23.7  28.4  33.2  40.3  47.6  52.0  56.9  65.4
#> 2022 151.9 186.1 223.4 276.8 333.6 400.6 459.6 488.0 519.0 577.6
#> 2023 756.9 785.9 801.9    NA    NA    NA    NA    NA    NA      
#>        Nov   Dec
#> 2005  63.1  65.1
#> 2006  86.0  86.3
#> 2007  72.4  69.5
#> 2008  61.8  63.7
#> 2009  87.8  92.4
#> 2010 124.5 127.1
#> 2011 162.3 167.6
#> 2012 171.2 168.9
#> 2013 164.4 169.3
#> 2014 239.2 248.9
#> 2015 317.3 325.1
#> 2016 408.6 417.9
#> 2017 442.8 443.2
#> 2018 428.4 436.7
#> 2019 378.4 320.9
#> 2020  16.0  17.0
#> 2021  81.8 113.5
#> 2022 650.2 708.2
#> 2023            
#> 
#> $random
#>         Jan    Feb    Mar    Apr    May    Jun    Jul    Aug    Sep
#> 2005     NA     NA     NA     NA     NA     NA 0.9081 1.1808 0.9495
#> 2006 0.9316 0.6966 0.9857 0.8914 0.7962 1.2214 1.3563 1.0772 0.9862
#> 2007 1.2280 0.7800 0.8651 0.9404 1.2102 1.2273 0.5935 0.7551 1.2755
#> 2008 2.3536 1.3976 1.1678 1.0825 0.9248 0.8126 0.7389 0.9516 0.8200
#> 2009 1.4861 0.8530 1.6809 0.9550 0.8632 0.9758 1.2171 1.2429 1.1342
#> 2010 0.7268 1.2108 1.1488 1.0414 1.1541 0.9824 0.8993 0.9255 0.9666
#> 2011 1.0373 1.1899 0.8272 1.0773 0.8820 1.1525 0.8478 1.2374 0.9286
#> 2012 0.9449 1.3864 1.1182 1.0428 1.3177 1.0077 0.6284 0.7509 1.1163
#> 2013 1.4036 0.6711 1.2599 1.2062 1.1398 0.9851 0.8360 0.8538 0.8289
#> 2014 0.7226 0.8535 0.9625 1.1919 1.1763 0.9273 1.0106 0.9193 1.1971
#> 2015 0.5596 0.9161 1.1940 0.9140 1.3598 0.9434 1.1343 0.9915 0.9103
#> 2016 0.7889 1.0447 1.1673 1.3587 0.9736 1.2758 0.9267 0.7890 0.8630
#> 2017 0.6271 1.3683 0.8463 1.5299 1.2893 0.8511 1.1984 0.8621 0.8289
#> 2018 0.5281 1.4167 1.1181 1.0898 1.1600 1.0708 1.0191 0.9021 0.8827
#> 2019 0.5879 0.6142 0.6101 1.3017 1.1162 1.3847 1.1860 0.8836 0.7873
#> 2020 0.9901 1.0045 0.5148 0.0163 0.0194 0.1046 0.9022 1.0017 0.8734
#> 2021 1.0276 0.7884 0.5629 0.3870 0.4210 0.3957 1.0131 1.1311 1.1480
#> 2022 0.6752 0.4762 0.5068 0.4340 0.6566 1.1421 1.0125 0.9729 0.9322
#> 2023 0.8097 0.7605 0.8921     NA     NA     NA     NA     NA     NA
#>         Oct    Nov    Dec
#> 2005 1.2441 0.8514 0.7444
#> 2006 1.1921 1.2224 0.6211
#> 2007 0.6743 0.7259 0.7119
#> 2008 0.8652 0.6613 0.7285
#> 2009 0.7354 0.9974 0.5133
#> 2010 1.3469 1.0312 0.5678
#> 2011 0.9727 1.4028 0.4429
#> 2012 1.0819 0.9888 0.6714
#> 2013 0.9726 1.2359 0.6332
#> 2014 1.1035 0.7713 0.9524
#> 2015 0.9505 0.8131 0.8529
#> 2016 0.9685 0.7658 0.9446
#> 2017 0.8566 0.7383 1.1931
#> 2018 0.8891 1.0083 1.2226
#> 2019 0.9738 1.0305 2.4057
#> 2020 0.7076 0.9511 1.3372
#> 2021 0.7842 1.0568 1.1988
#> 2022 1.1095 1.1761 1.6868
#> 2023                     
#> 
#> $figure
#>  [1] 0.760 0.576 0.751 1.182 1.287 1.128 1.016 0.952 1.195 1.326
#> [11] 0.857 0.970
#> 
#> $type
#> [1] "multiplicative"
#> 
#> attr(,"class")
#> [1] "decomposed.ts"

# Perform STL decomposition
stl(tourism_ts, s.window = "periodic") %>% 
  autoplot() +
  labs(x = "Date", y = "Number of tourists", title = "STL Decomposition for number of tourists from Philippines to Zurich") +
  theme_minimal()

7.5 ZH - Seasonality

Seasonal sub-series plot permit to better visualize the monthly fluctuations of each year, from 2005 to 2023.

Click to show code
# Plot the seasonality in one chart
ggseasonplot(tourism_ts, year.labels = TRUE, year.labels.left = TRUE) +
  scale_color_viridis_d() +
  theme_minimal()

We clearly observe here that the seasonality is not constant over time.

7.6 ZH - Naive Forecast

The graph shows the historical trend in the number of tourists from the Philippines to Zurich and the forecasts for the next 15 months using the naive model. We took the graph representing the total number of tourists coming from the Philippines to Zurich.

Click to show code
plot_naive

This naive model predicts that future values will be equal to the last observed value in the time series. It does not take into account the past events like the pandemic and assumes here that the levels observed after this extreme fall will remain unchanged. The model does not take into account trends or seasonality neither, which are very present in our case. It’s a simplified approach.

The blue areas represent the 80% (darker) and 95% (lighter) confidence intervals of the forecasts. The wider the interval, the greater the uncertainty about the long-term forecasts which is the case here.

We can see here the metrics of the naive model :

Click to show code
metrics_naive
Naive Model Metrics
.model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
NAIVE Training 0.012 0.5 0.369 -1.03 8.9 0.553 0.434 -0.103

7.7 ZH - ETS AAA model

Click to show code
plot_ets_AAA

Clearly see here that the confidence interval is too big, almost like a naive forecast, therefore as suspected we will move to a multiplicative seasonality model.

We see here the metrics of the ETS AAA model :

Click to show code
metrics_ets_AAA
ETS AAA Metrics
.model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
ETS_M_seaso Training 0.015 0.449 0.295 -1.29 7.86 0.442 0.39 0.078

7.8 ZH - Multiple Regression Analysis

Simple yet effective, if the relationships between the covariates and the dependent variable (tourist numbers) are linear.

Here is a summary of the effect of the variables on the number of tourists from the Philippines to Zurich :

Click to show code
# Fit a multiple regression model
model <- lm(value ~ avg_stringency_index_PH + avg_stringency_index_SW, data = merged_data)

# Summary of the model
summary(model)
#> 
#> Call:
#> lm(formula = value ~ avg_stringency_index_PH + avg_stringency_index_SW, 
#>     data = merged_data)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -318.9 -157.3  -69.3   93.7  873.7 
#> 
#> Coefficients:
#>                         Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)               246.32      15.85   15.54   <2e-16 ***
#> avg_stringency_index_PH     5.87       2.55    2.30   0.0221 *  
#> avg_stringency_index_SW   -12.19       3.73   -3.26   0.0013 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 220 on 222 degrees of freedom
#> Multiple R-squared:  0.0931, Adjusted R-squared:  0.0849 
#> F-statistic: 11.4 on 2 and 222 DF,  p-value: 1.95e-05

# Forecast the next 24 months
forecast_values <- predict(model, newdata = merged_data)

#us gtsummary to show the summary of the model
model %>%
  gtsummary::tbl_regression()
Characteristic Beta 95% CI1 p-value
avg_stringency_index_PH 5.9 0.85, 11 0.022
avg_stringency_index_SW -12 -20, -4.8 0.001
1 CI = Confidence Interval

We observe that the p-value are lower for Switzerland than for Philippines. This means that government stringency index in Switzerland has a more significant impact on the number of tourists than in the Philippines. Indeed, the beta of -12 for Switzerland and 5.9 for Philippines shows that the number of tourists in Switzerland decreases by 12 for each unit of stringency index, which seems logical. However the number of tourists in the Philippines increases by 5.9 for each unit of stringency index, which is counter-intuitive. This could be due to the fact that the stringency index is not the only factor that influences the number of tourists from the Philippines to Zurich.

We can observe here the metrics:

Click to show code
#create a table with the metrics of the model and show it as an html
model_metrics <- model %>%
  broom::glance()

# show html table with metrics
model_metrics %>%
  kable("html", caption = "Model Metrics") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Model Metrics
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.093 0.085 220 11.4 0 2 -1531 3070 3084 10734309 222 225

BLABLABLA explain metrics The way higher AIC shows that this model is not the best.